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SECTION  1 
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BACKGROUND  & OVERVIEW 


Components  which  were  only  a very  short  time  ago  thought  to  be  beyond 
the  capabilities  of  the  semiconductor  manufacturers  are  almost  daily 
becoming  a reality  through  rapidly  developing,  highly  sophisticated  inte- 
grated circuitry  techniques.  The  exponential-like  increase  of  capabilities 
in  LSI  and  soon,  VLSI,  coupled  with  parallel  evolution/maturation  of 
computer  science  itself,  has  created  and  will  continue  to  create  new  forms 
of  data  and/or  signal  processing  equipment  at  an  ever  accelerating  rate. 
These  changes  continue  to  influence  the  potential  "payoffs”  of  distributed 
process! 

The  cxity  and  perhaps  novelty  of  many  of  these  processing  systems 

is  beyond  the  scope  of  this  report  and  often  may  preclude  categorization 
altogether.  However,  an  attempt  should  be  made  to  categorize  and  to  a 
lesser  degree,  catalogue  at  least  generic  architectures  in  order  to  obtain 
a firmer  grasp  of  the  meaning  of  distributed  processing. 

In  the  context  of  this  report,  computer  architecture  is  distinguished  from 
hardware  implementation  detail,  and  therefore  is  defined  as  "the  structure 
of  the  computer  which  a machine-level  programmer  needs  to  know  in  order 
to  write  programs  which  will  run  correctly  on  that  computer.  " 


''I 

j 

I 

t 
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The  architecture  thus  consists  of  the  instruction  set,  the  address  space, 
and  the  registers  which  are  made  available  or  "visible"  to  the  machine 


language  programmer.  In  general,  nothing  which  is  time-dependent  is 

considered  (strictly)  part  of  an  architecture.  Comparing  computers  is  | 

much  different  than  comparing  architectures,  as  will  become  very  clear 

shortly.  j 

.1 
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SECTION  2 


THE  CLASSIFICATION  AND  EVOLUTION 
OF  DATA  PROCESSING  SYSTEM  ARCHITECTURES 


Figure  1 shows  a tree  structure  which  can  be  used  to  depict  the  decision 
space  within  which  one  can  classify  most  architectures.  The  first  decision 
is  single  control  stream,  multiple  control  stream  (MCS)  with  no  inter- 
action, or  MCS  with  interaction.  The  concept  of  stream  is  used  in  this 
report  to  mean  a sequence  of  items  (data  or  instructions)  which  are  exe- 
cuted or  operated  on  by  a processor.  [FLYN  66],  [FLYN  72] 


MD 


Figure  1.  Taxonomic  Tree 


3 


] 


I 

1 

A single  control  stream  implies  a single  source  of  instructions.  At  the 

instruction  stream  level,  the  instruction  may  cause  more  than  one  machine-  | 

level  instruction  to  occur  (in  other  words,  the  single  instruction  stream 

and  multiple  instruction  stream).  Multiple  control  stream  then  suggests 

multiprocessors  or  special  machines  with  more  than  one  source  for  high- 

level  instructions.  | 

If  these  multiple  control  streams  interact  with  each  other  (i.  e. , cooperate  j 

or  communicate),  the  right-most  branch  of  Figure  1 is  invoked.  The  most 

complex  branching  in  the  tree,  at  the  MCS  with  interaction,  is  not  covered  j 

in  detail  here.  A more  complete  treatment  of  these  branches  anJ  leaves 
may  be  found  in  "Computer  Interconnection  Structures:  Taxonon.v  Charac- 
\ teristics,  and  Examples.  " [ANDE  75]  This  taxonomy  is  a detailed  dis-  . 

I cussion  of  a complex  four-level  tree  which  culminates  in  ten  "leaves.  " 1 

' Each  leaf  is  in  turn  a generic  type  of  network  or  computer  interconnection  . 

scheme  (e.g.,  star,  global  bus,  ring,  etc.).  J 

The  single  control  stream  branches  ultimately  to  SISD  (single  instruction  1 

stream,  single  data  stream),  SIMD  (single  instruction  stream,  multiple  i 

data  stream),  MISD  and  MIMD  type  architectures.  [FLYN  72]  Figure  2 1 

, shows  pictorially  what  these  concepts  are.  A SISD  machine,  commonly  j 

known  as  an  uniprocessor,  has  a number  of  programs  P^,  P^.  . . Pn  opera-  ' 

ting  sequentially  on  data  sets  D^,  D^.  . . Dn.  The  SIMD  approach,  typified  i 

by  ILLIAC  IV  (array)  and  PEPE  (ensemble)  has  a single  program  opera-  J 

ting  on  an  array  of  data  sets.  (The  array  and  ensemble  configuration  will  i 

be  discussed  later. ) We  are  not  currently  aware  of  an  example  of  a single  ' 

control  stream,  multiple  instruction  stream,  multiple  data  stream  archi-  j 


lecture.  MISD  is  commonly  interpreted  as  the  pipeline  case,  typified  by 
the  TI-ASC, 

Figure  3 and  Table  1 show  graphically  how  an  evolution  has  taken  place  in 
computer  architecture.  Three  decades  ago  computers,  as  most  people 
think  of  them  (von  Neumann)  [Burk  46]  [WLLK  51],  were  just  being 
developed.  Five  years  prior  to  that,  electromechanical  nonstored  program 
machines  were  built  and  used  for  the  military  at  such  places  ^s  Harvard 
University  and  Bell  Labs.  /' 


Figure  3.  Uniprocessor  Capability  versus  User  Requirements 
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Table  1.  Data  Processing  Technology  Trends 


This  may  consist  of  a combination  of  any  or  all  previous  architecture  types 
integrated  at  one  or  more  levels  of  abstraction  for  a specific  application 
problem  and  its  associated  objectives.  BMD's  need  is  for  problem-driven 
architectures  (PDAs)  rather  than  for  distributed  processing  architectures 
(in  the  usual  computer  science  sense  of  the  term). 


Figure  3 shows  that  in  the  early  days  of  EDP,  the  data  processing  systems 
were  more  than  capable,  albeit  crude  by  modern  standards.  As  time  and 
technology  marched  on,  computers  with  increased  capability  became  avail- 
able, but  unfortunately,  computational  loading  caught  up  quite  fast  with  the 
SISD  machines.  Around  1960,  users  began  to  turn  to  multiprocessor 
systems  for  their  required  throughput  and  reliability.  By  1964,  Control 
Data  Corporation  was  delivering  the  CDC  6600,  which  provided  pipelining 
of  instructions  and  used  multiple  (ten)  arithmetic  units  to  achieve  a speed 
increase  of  more  than  two  and  one-half  times  that  of  the  CDC  6400  (which 
had  only  one  comparable  arithmetic  unit).  [BELL  71]  The  first  CDC  7600 
was  delivered  in  1969,  and  eventually  in  the  early  1970's,  the  CDC  STAR  100 
emerged.  Each  machine  delivered  marked  increases  in  performance  over 
its  predecessor. 

Control  Data  Corporation  was  not  alone  in  the  evolutionary  period,  but 
the  6600,  7600  and  STAR  are  prime  examples  as  well  as  well-known  machines. 
The  late  1960’s  produced  the  ILLIAC  IV,  which  was  an  outgrowth  of  the 
SOLOMON  work  done  by  Slotnick.  [SLOT67],  fSLOT62]  More  recent  devel- 
opments have  been  PEPE,  TI  ASC  and  the  CRAY -I,  to  name  a few.  These 
systems  are  motivated  by  two  main  forces:  1)  the  need  for  large  systems  to 
perform  special  processing,  and  2)  the  need  to  take  advantage  of  new  LSI 
technology  as  it  emerges  by  utilizing  repetitive  system  organizations. 

Figure  3 shows  a continuing  divergence  between  SISD  performance  and  user 
requirements,  although  both  are  increasing.  The  only  way  to  narrow  the 
gap  will  be  a continued  emphasis  on  distributed  processing  systems, 
coupled  with  continued  growth  in  integrated  circuit  technology. 


SECTION  3 


EXAMPLES,  COMPARISONS,  AND  CONSIDERATIONS 


This  section  discusses  and  shows  by  example  the  characteristics  of  the 
SISD,  SIMD,  MISD,  and  MIMD  machine  types.  It  also  discusses  the  dif- 
ferences within  an  architecture  (e.  g, , speed,  pxjwer,  cost)  and  discusses 
the  circumstances  under  which  one  architecture  or  specific  machine  within 
a class  might  be  chosen  over  another. 

UNIPROCESSORS 

When  considering  the  SISD  machines  as  a family,  one  finds  a nearly  com- 
plete spectrum  of  performance  parameters,  prices,  and  simple  physical 
size.  Strictly  speaking,  the  term  uniprocessor  thus  applies  to  something 
as  small  and  relatively  primitive  as  an  INTEL  4004,  as  well  as  the  very 
large  and  highly  sophisticated  Burroughs  B7800. 

When  removed  from  a relatively  simple  batch  mode  environment,  the  SISD 
machines  have  several  severe  drawbacks.  In  general,  uniprocessors  are 
not  as  fault- tolerant  as  multiple  processor  architectures.  Although  some 
of  the  large  machines  may  have  a certain  amount  of  error  correction  logic 
in  memory  and  distributed  throughout  the  system  (e.  g. , B7700),  they  do 
not  truly  allow  for  graceful  degradation.  If  the  CPU  (or  a single  main 
memory  module)  fails,  it  is  a catastrophic  failure. 


SISD  machines,  although  often  used  for  real-time  control  applications,  are 
far  from  ideal  for  this  type  of  application  and  usually  require  tricky  multi- 
programming techniques.  The  costs  of  programming  a single  SISD  machine 
to  do  a job  equivalent  to  that  of  a relatively  simple  distributed  system  be- 
come outrageous  (e.  g. , the  AN/UYK-16  used  in  the  AN/SQS-56  sonar). 

Expandability  is  generally  relegated  to  a simple  but  often  expensive  trade 
upwards.  Unless  the  user  sticks  to  the  same  vendor,  however,  he  is 
usually  faced  with  incredible  software  and  I/O  compatibility  problems  in 
this  trade. 

The  general  comprehensibility  of  a uniprocessor  and  the  relatively  simple 
executive  program  are  the  major  reasons  why  there  is  and  will  continue 
to  be  much  inertia  to  ^tay  with  the  monoliths. 

Microprocessors 

Microprocessors  are  presently  divided  into  three  generations,  all  of  which 
are  currently  popular.  The  generation  division  is  made  largely  by  hard- 
ware technology.  First  generation  microprocessors  are  usually  PMOS, 
four-bit,  calculator-oriented  devices.  They  have  typically  up  to  a 4K-word 
address  capacity  and  are  very  slow  (e.g. , 62.  5 microsecond  register  add 
for  a Fairchild  PPS-25),  but  they  are  cheap  and  generally  well-suited  for 
such  uses  as  point-of-sale  terminals  or  hand  calculators.  [OGDE  74], 
[CUSH  75] 


Second  generation  micros  are  generally  NMOS  or  CMOS,  8-bit  word  size, 
faster,  and  have  a larger  address  space  (e.g. , RCA  COSMAC  has  64K-word 
range  and  a six-microsecond  add  time).  [SWAL  74] 

The  third  generation  is  typified  by  the  AM2900  family.  This  generation 
differs  not  only  in  the  IC  technology  which  is  typically  bipolar  (TTL,  IIL, 
ECL),  but  also  in  that  they  have  been  "sliced"  vertically,  usually  into  4-bit 
slices.  These  processors  are  fast  due  to  their  bipolar  nature,  but  are 
also  power-hungry  relative  to  previous  generations.  This  has  resulted  in 
less  logic  per  chip,  which  in  turn  has  resulted  in  the  slice  concept  and, 
more  significantly,  a microprogrammable  microprocessor.  One  might 
say  the  microprogrammability  of  the  bit  slice  processors  is  a fortuitous 
consequence  of  a flaw  in  the  bipolar  technology.  Since  there  is  less  hard- 
ware on  the  chips,  control  cannot  be  very  complicated,  and  therefore,  a 
good  deal  of  emulation  of  the  control  via  firmware  is  called  for.  Micro- 
programmability makes  these  chip  sets  the  most  flexible  of  all  the  micro- 
processors. 

Table  2 summarizes  the  progression  in  IC  technology.  It  should  be  noted, 
however,  that  these  technologies  are  not  confined  to  microprocessors.  Any 
technology  available  for  microprocessor  fabrication  (LSI)  can  naturally  be 
used  for  custom  LSI  or  even  MSI  to  implement  anything. 

Minicomputers 

Minicomputers  may  have  the  widest  range  of  performance  parameters  of 
any  class  of  uniprocessors.  At  the  low  end,  minis  are  often  outperformed 
by  high  end  micros.  At  the  high  end,  a mini  overlaps  medium-scale 
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Table  2.  Microprocessor  Fabrication  Processes 


f 


computer  performance  (e.  g. . PDP-11/60,  PDP-11/70).  Also,  just  as 
microprocessors /microcomputers  (there  is  a difference)  come  either 
microprogrammed  (i.  e. , user  has  a fixed  instruction  set),  microprogram- 
mable  (user  makes  up  firmware  to  emulate  target  instruction  set)  or  hard- 
wired (special  purpose),  so  do  minicomputers.  Most  newer  minis  have 
microprogrammed  control  store  and  some  have  a portion  of  the  control 
store  set  aside  for  user  customization  of  the  instruction  set  (e.g. , 
AN/UYK-20,  Interdata  8/32,  Varian  73,  etc. ). 


Instruction  execution  times  vary  from  less  than  300  nanoseconds  to  over 
one  microsecond.  Memory  addressing  schemes  include  memory  manage- 
ment, virtual  memory,  and  paging,  among  others.  Physical  memory  can 
range  from  as  little  as  8K  to  as  many  as  a million  words.  Optional  features 
abound,  often  including  exotic  memory^ chemes,  variations  on  I/O,  DMA 
and  interrupts.  Floating  point  arithmetic  is  becoming  popular  as  either 
standard  at  the  upper  end  of  the  lines  or  an  add-on  option  (effectively  a 
peripheral)  at  the  mid  or  lower  end.  Hardware  multiply  and  divide  are 
often  available,  whereas  current  microprocessors  (with  perhaps  one  or 
two  exceptions)  do  not  supply  this  function. 

The  minicomputer  family  is  by  far  the  largest  and  most  widely  used  group 
of  SISD  machines  currently  on  the  market.  A mini  may  be  relegated  to 
such  relatively  menial  tasks  as  supporting  a small  service  bureau,  or  it 
may  be  part  of  a complex  real-time  processing  or  process  control  system. 
Groups  of  minis  are  often  used  in  networks  (such  as  C.  mmp  at  Carncgie- 
Mellon  University)  to  achieve,  through  distributed  processing,  the  power 
of  a single  "supercomputer"  at  a fraction  of  the  cost  and  at  a much  greater 
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level  of  availability.  Such  applications  address  difficult  problems,  such 
as  speech  recognition,  nuclear  reaction  calculations,  etc. 

Mil  Spec  Computers — Generally,  military  computers  (e.g., 

AN/UYK-20.  AN/AYK-14.  AN/UYK-16,  AN/UYK-30,  etc. ) are  mini- 
computers which  have  been  repackaged  to  meet  military  environmental 
requirements.  (For  instance,  there  is  no  difference,  architecturally, 
between  a Raytheon  704  and  an  AN/UYK-16. ) 

The  specifications  imposed  by  the  military  on  computer  manufacturers  are 
generally  mechanical  and  electrical  rather  than  architectural.  Certain 
I/O  protocols  maybe  specified  (e.g.,  NTDS  for  UYK-20),  but  in  general, 
the  emphasis  is  on  reliability,  physical  size,  weight,  and  power  consump- 
tion, as  well  as  other  environmental  considerations,  such  as  shock  and 
vibration,  temperature.  EMI  shielding,  and  radiation. 

Medium-Scale  Computers 

Medium-scale  computers  fill  a rapidly  dwindling  niche  between  the  mini- 
computer and  the  large-scale  computer.  Current  top -of -the -line  mini- 
computers are  overlapping  the  medium-scale  computers  at  the  lower  end, 
and  many  of  the  large-scale  computers  are  more  cost-effective  than  the 
top  end  medium-scale  computer.  An  archetypical  medium-scale  computer 
might  be  the  very  successful  Burroughs  B3700. 


The  patriarchs  of  the  SISD  family  are  the  large-scale  computers,  the  DEC 
SYSTEM  10,  Burroughs  B6700  and  up.  These  types  of  machines  have  a 
basic  clock  of  8 to  20  megacycles,  and  one-  or  two-million-word  (48  to 
64  bit)  memories,  usually  backed  up  by  a large  amount  of  disk  and  tape 
memory.  They  have  large  very  complex  master  control  programs  and  are 
usually  made  to  accept  higher  level  languages. 

An  exhaustive  poll  or  study  of  the  market  will  indicate  that  there  is  no 
standard  by  which  to  measure  large  scale  CPU's.  Comparisons  may  be 
made  between  models  from  the  same  manufacturer.  Major  mainframe 
manufacturers  often  compare  their  product  against  specific  models  of  a 
competitor's  line.  Generally,  each  manufacturer  is  strong  somewhere 
where  most  of  the  competition  is  weaker,  but  invari'-bly  falls  short  else- 
where. 

Several  key  characteristics  seem  to  be  more  prone  to  comparison 
than  most.  Since  these  several  characteristics  appear  foremost  in  the 
minds  of  many  users,  they  may  be  used  to  establish  what  is  generally 
meant  by  "large-scale  computing  power.  " 

Basic  architecture  is  not  a characteristic  of  concern  here,  since  on  the 
grand  scale,  essentially  all  manufacturers  stick  to  the  general  von  Neu- 
mann architecture.  The  implementation  of  that  general  architecture  is 
where  one  finds  differences  of  significance.  The  areas  within  an  archi- 
tecture which  appear  to  be  performance  adjustors  are;  1)  main  storage; 
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2)  central  processing  unit  registers  and  data  paths;  3)  sequence  control; 
and  of  course,  4)  input/output.  As  was  stated  earlier,  a deficiency  in 
one  characteristic  is  often  compensated  for  by  excellence  in  another. 

Main  storage  is  the  area  which  most  affects  the  other  performance  ad- 
justments. It  is  in  itself  adjustable  in  several  ways,  each  with  major 
and  subtle  effects  elsewhere. 

The  speed  of  main  storage  is  the  most  obvious  characteristic  or  variable 
to  consider  when  looking  at  a machine.  Typical  main  memory  for  a large 
processing  system  currently  in  common  use  is  core.  Read-write  core 
typically  runs  from  a 2.  5 microsecond  cycle  time  down  to  much  less  than 
1 microsecond.  There  are  faster  storage  devices  of  course  (e.  g. , solid- 
state  memory),  but  they  are  not  currently  typically  found  as  main  memory 
in  many  large  existing  installations. 

Word  size  is  also  critical  when  classifying  a computing  machine;  the 
more  information  a word  contains,  the  more  data  the  machine  makes 
available  to  process  per  fetch,  and  the  faster  it  can  store  away  results. 
Word  size  ranges  typically,  in  large  machines,  between  32  and  64  bits. 

More  subtle  are  the  ramifications  of  sheer  volume  of  high-speed,  first- 
level  memory.  The  more  high-speed  local  memory  available,  the  fewer 
accesses  the  processor  must  make  to  other  slower  external  storage  media. 

The  second  performance  parameter  (CPU)  also  has  several  subfactors 
which  determine  its  general  characteristics.  Most  obvious  of 
these  factors  is  speed.  The  fundamental  determinant  of  CPU 
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speed  is  the  time  required  to  take  data  from  the  internal  registers,  process 
the  data,  and  get  the  result  to  a register.  This  is  determined  by  a number 
of  factors,  such  as  circuit  family,  location  of  general  floating  registers, 
as  well  as  the  width  and  speed  of  these  registers.  CPU  cycle  time  for  a 
large  machine  is  typically  between  1 microsecond  (for  older  machines) 
and  20  nanoseconds. 

Sequence  control,  like  the  CPU,  has  a cycle  time.  It  too  is  dependent 
upon  several  variables,  including  circuit  family.  However,  sequence 
control  speed  is  also  dependent  upon  the  extent  to  which  microprogrammed 
sequence  control  is  used  relative  to  sequential  logic.  Once  again,  a 
25-to-l  ratio  can  be  seen.  An  older,  slower,  large  machine  may  take  up 
to  1 microsecond  to  execute  a basic  operation,  where  a faster  machine 
with  little  or  no  microprogrammed  instruction  may  run  as  fast  as  20  nano- 
seconds. However,  the  cost  penalty  paid  for  the  speed  may  be  substantial. 
The  recurring  cost  of  the  fast  sequential  logic,  as  well  as  complexity  of 
design  and  maintenance,  maybe  greater  than  the  25-to-l  speed-up  it 
produces. 

Input/output  control  or  processing  is  an  outstanding  characteristic  or 
parameter  determining  machine  throughput.  In  most  cases,  a manu- 
facturer builds  a general  set  of  peripheral  devices  with  controllers  which 
he  will  use  with  machines  having  a wide  range  of  performance.  The 
difference  in  I/O  rate  between  machines  occurs  for  two  reasons.  The 
most  obvious  is  bus  width  between  processor  and  peripheral  device.  The 
most  important  difference,  however,  is  the  amount  of  hardware  devoted  to 
I/O  tasks.  In  a basic  machine,  the  CPU  communicates  directly  with  the 
peripherals  and  must  stop  all  other  processing  to  do  so.  If  this  were  the 


case  for  a large  machine,  given  the  other  previously  established  criteria, 
we  would  call  the  machine  "l/O-bound.  " At  the  opposite  end  of  the  spec- 
trum, the  very  large-scale  machines  such  as  the  B7700  essentially  free 
the  CPU  from  doing  I/O  by  having  one  or  more  special  purpose  I/O  pro- 
cessor. Again,  a 5-to-l  ratio,  2 million  bits  per  second,  and  up  to  10 
million  or  more  bits  per  second,  are  commonly  attained. 

Large-scale  computers  (SISD)  are  used  for  a variety  of  tasks  but  not  nearly 
as  commonly  as  the  minicomputer.  No  doubt,  the  largest  user  of  large- 
scale  computers  is  the  U.  S.  Government.  Banks  and  insurance  com- 
panies, as  well  as  most  big  business  organizations,  employ  large-scale 
computers  and  sometimes  networks  of  large  scale  computers  to  keep 
records,  control  complex  plant  automation,  forecast  business  trends,  etc. 

But  for  the  extremely  large  scientific  applications,  large-scale,  general- 
purpose  SISD  machines,  even  in  multiprocessor  configurations,  are  in- 
adequate. 

CONCURRENT  PROCESSING  ARCHITECTURES 
The  Spectrum  of  Distribution 

Distributed  processing  means  many  things  to  many  people,  unfortunately. 

It  is  still  in  1977  a fairly  esoteric  discipline  with  little,  if  any,  standards, 
although  the  Distributed  Processing  workshop  held  at  Brown  University  in 
August  1976  and  August  1977  and  a wealth  of  other  gatherings  of  top  scientists 
involved  in  distributed  processing  are  attempting  to  codify  this  nebulous  area. 
[CAN  76] 


Distribution  may  occur  at  many  levels.  One  can  visualize  a spectrum 
with  the  ARPANET  at  one  extreme  and  a processor  such  as  the  RAM-1  at 
the  other.  [RAM  76] 

ARPANET  by  some  definitions  is  not  distributed  processing  at  all,  but 
by  others  it  may  be  considered  a very  loosely  coupled,  geographically 
distributed,  heterogeneous  network.  In  some  sense  then,  it  is  distributed 
processing  where  there  is  no  unit  of  cooperation  but  simply  communication 
between  the  nodes  in  the  network.  [BELL  74] 

Close  to  the  opposite  end  of  the  spectrum  is  C.  mmp — a very  tightly 
coupled  homogeneous  network  using  parallel  processing  to  achieve  its 
throughput  objectives.  C.  mmp  divides  single  programs  into  pieces  and 
parses  the  pieces  out  to  several  cooperating  parallel  processors  (PDP-11 ’s). 
[BELL  72],  [WULF  75] 

Taking  this  concept  one  step  further,  processors  like  the  RAM-1,  de- 
signed at  Brown  University  and  the  Moore  School  of  Electrical  Engineer- 
ing at  the  University  of  Pennsylvania,  will  probably,  by  1983,  be  as 
commonly  used  as  today's  PDP-ll*s.  The  RAM-1  uses  distributed  pro- 
cessing as  an  architectural  tool- -distributing  at  the  lowest  possible  (i.  e. , 
fixed  function)  level  where  the  distribution  problems  are  hidden  from  the 
user.  Through  additional  architectural  techniques,  a fairly  high- 
performance  machine  can  be  realized  out  of  a network  of  bit-slice 


microprocessors.  So  the  unit  of  cooperation  in  C.mmp  is  a phase  of 
a program,  and  the  unit  of  cooperation*  in  a RAM-1  type  configuration 
is  the  individual  instruction. 

Functional  distribution  is  of  greatest  interest  and  common  usage  now.  As 
was  pointed  out  above,  functional  distribution  may  occur  at  various  levels. 
Furthermore,  a distributed  processor  may  be  general  purpose  or  special 
purpose,  and  in  either  case,  the  individual  processors  which  make  up  the 
processor  may  be  very  much  heterogeneous  or  homogeneous. 

It  ib  worthwhile  to  note  that  although  essentially  all  well-known  concurrent 
and/or  distributed  processing  systems  achieve  or  surpass  large-scale 
SISD  performance,  it  is  not  the  case  that  these  classes  of  architectures 
(SIMD,  MISD,  MIMD)  are  necessarily  high-performance  machines.  That 
is,  it  is  by  design  and  not  generic  definition  that  the  machines  commonly 
thought  of  as  concurrent  and/or  distributed  processors  generally  out- 
perform the  SISD  machines, 

SIMD  Machines 

In  the  early  1960's,  when  it  became  apparent  that  the  scientific  community 
was  beginning  to  formulate  problems  too  large  and  complex  for  large 
SISD  machines  and  even  large-scale  multiprocessor  machines,  the  SIMD 


*The  term  "unit  of  cooperation"  as  used  here  is  related  to  the  term  "homo- 
geneous problem"  as  referred  to  in  C,  Vick's  definition  of  distributed 
processing. 
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architectures  began  to  take  more  form.  SIMD  was  not  a radical  new 
idea  in  the  l960's,  but  the  thought  of  actually  implementing  a fast  giant 
like  ILLIAC  IV  seemed  like  science  fiction.  Most  of  the  problems  of 
implementing  and  operating  ILLIAC  were  based,  again,  on  semiconductor 
manufacturers'  problems.  ILLIAC  could  probably  be  rebuilt  today  with 
current  LSI  technology  and  would  result  in  a much  more  effective,  reliable 
machine.  (Of  course,  there  would  still  be  the  problem  of  figuring  out  how 
to  use  it. ) 

The  SIMD  machine  organization  has  N processors  controlled  by  a single 
controlling  processor.  These  N processing  elements  will  work  in  parallel, 
all  (or  a selected  subset)  performing  the  same  operation  simultaneously 
on  different  data.  All  nonselected  processors  do  nothing.  They  may  share 
the  same  single  memory  or  a set  of  N independent  memories,  the  memories 

I » 

j j together  making  up  the  main  memory  For  the  computer. 

i ' 

•'  I With  N processors,  it  is  better  to  have  N independent  memories  because 

each  processor  can  have  immediate  access  to  a location  for  storage  of 
■ frequently  used  programs  and  results.  Having  N memories  resolves  some 

of  the  conflicts  created  by  N processors  requiring  memory  access 
to  a single  memory.  The  memories  could  be  connected  through  a cross- 
bar, but  the  delay,  the  complexity,  and  the  cost  for  conflict  resolution 
for  such  a system  would  be  counterproductive  for  large  N (since  the 
i number  of  active  elements  and  links  increases  as  N ).  [SULL  771 

! Because  of  this,  in  most  parallel  computers,  each  memory  unit  is 

! associated  with  one  of  the  processing  elements,  and  only  that  processing 

i 
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element  has  immediate  access  to  the  memory.  Therefore,  there  must  be 
interprocessor  communication  to  get  a memory  reference  from  one  memory 
element  to  a processing  element  not  associated  with  that  memory  element. 

Processor  connections  are  needed  for  such  interprocessor  communication. 
The  kind  of  connection  used  affects  the  way  the  computer  solves  certain 
problems.  Some  connection  schemes  are  better  suited  for  solving  one 
type  of  problem  than  another. 

One  interconnection  pattern  known  to  be  good  for  a broad  class  of  problems 
is  the  cyclic  shift  used  in  ring  networks  (e.g. , Farber's  DCS  [Farb  72]  ). 
This  gives  each  processor  two  neighbors.  The  path  of  connections  goes 
around  a circle.  With  unidirectional  paths,  the  maximum  number  of 
routing  steps  required  to  pass  information  from  one  processor  to  another 
is  N-1.  With  bidirectional  paths,  the  worst  case  is  (N/2>-l.  For  a number 
of  parallel  algorithms,  (e.g.,  Fourier  transforms,  sorting,  and  matrix 
transposition)  the  ideal  connection  is  a "perfect  shuffle"  connection.  The 
information  paths  between  processors  follow  the  pattern  that  cards  fall  in 
when  they  are  perfectly  shuffled.  If  a deck  of  cards  falls  in  when  it  is 
perfectly  shuffled,  the  cards  would  be  divided  into  two  equal  piles  and  then 
intermingled  so  that  the  first  card  of  the  first  half  was  first,  the  first  card 
of  the  second  half  was  second,  the  second  card  of  the  first  half  was  third, 
etc.  The  perfect  shuffle  connection  is  not  optimal  for  several  other  appli- 
cations. Because  solving  partial  differential  equation  requires  iterations 
that  effectively  update  a grid  point  as  a function  of  the  values  of  nearby 
grid  points,  Slotnick  proposed  the  array  structure  of  interconnection  now 
used  in  the  ILLIAC  IV,  Each  processor  has  four  neighbors,  and  the  paths 
are  bidirectior.al. 
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A complete  interconnection  network,  where  each  processor  is  directly 
connected  to  every  other  processor,  is  the  most  flexible  connection  scheme, 
but  the  cost  required  for  N processors  increases  proportionally  with  the 
square  of  N (as  stated  before).  Thus  for  large  N,  this  is  not  a realistic 
connection  scheme.  The  best  design  would  emphasize  connections  most 
widely  used  in  SIMD  algorithms  in  comparison  with  SISD  algorithms. 

As  a rule,  the  SIMD  machines  do  not  require  multiprogramming  techniques 
to  achieve  high  throughput. 

Quite  often,  associative  techniques,  such  as  those  used  in  PEPE  (to  be 
explained  later),  are  employed.  The  STARAN  is  another  example  of  an 
associative  SIMD  machine — the  only  one  commercially  available.  [ STAR  75l 

Fault  tolerance  can  be  incorporated  into  this  architecture  fairly  easily. 

When  an  element  or  elements  fail,  their  load  may  be  picked  up  by  others 
until  repairs  can  be  made.  By  distributing  the  processing  and  the  control, 
single  failures  are  usually  not  catastrophic.  A good  distribution  will  allow 
a good  MTBF  and  a minimal  MTTR,  resulting  in  a much  more  reliable, 
useful  system. 

ILLIAC  rV--ILLIAC  IV  is  a SIMD  Array  Processor.  It  fully  implemented 
today,  it  would  be  the  world's  largest  processor.  Unfortunately  only  one- 
fourth  of  ILLIAC's  processing  elements  were  built.  This  was  due  to  both 
the  inability  of  certain  semiconductor  manufacturers  to  keep  promises 
and  also  due  to  software  limitations. 


The  processing  elements  are  grouped  into  8x8  arrays;  each  array  is 
called  a quadrant.  There  is  a single  control  unit  for  each  quadrant,  and 
each  array-control  unit  set  was  to  have  communicated  with  the  other  three 
in  a ring.  Within  the  8x8  array,  each  individual  processing  element  (PE) 
is  connected  to  its  four  nearest  neighbors.  A processing  element  is 
essentially  a general-purpose  computer  with  most  of  the  control  outboard 
(i.  e, , in  the  control  unit  for  the  array).  Since  all  PE's  execute  their 
instructions  in  parallel,  most  control  signals  are  generated  remotely  in 
the  control  units  (CU),  thereby  considerably  reducixig  PE  implementation 
costs.  Figure  4 shows  ILLIAC  as  it  was  supposed  to  be  built,  and  Figure  5 
illustrates  the  PE  and  CU  structure. 

The  ILLIAC  IV  has  almost  no  operating  system.  A user  takes  hold  of  the 
system,  runs  his  problem,  and  then  lets  it  go;  the  next  user  does  the  same 
thing.  In  the  smaller  computers  (PDP-lls)  that  surround  ILLIAC's  control 
unit  and  processing  elements,  there  is  more  complex  software  that  attracts 
a queue  of  users  waiting  to  get  at  the  big  machine  and  allows  them  to  per- 
form nonarithmetic  companion  processes.  But  the  actual  ILLIAC  operating 
software  itself  is  very  simple. 

ILLIAC  is  used  for  a variety  of  problems  for  which  virtually  no  other 
machine  is  suitable  today.  The  NASA  Ames  users,  who  now  take  up  about 
20  percent  of  ILLIAC  IV's  operating  time,  are  solving  two-dimensional 
aerodynamic  flow  equations,  (These  involve  the  solution  of  complicated 
partial  differential  equations, ) The  remaining  users  are  running  weather 
prediction  and  climate  models,  and  several  types  of  signal  processing 


computations,  including  fast  fourier  transforms.  Other  applications 
include  solving  problems  involved  in  beam  forming  and  convolution,  seismic 
research,  and  radiation  transport  for  fission  reactors. 

The  original  design  of  ILLIAC  IV  with  its  four  arrays  was  supposed  to 
perform  1000  million  floating  point  operations  ("megaflops")  per  second. 
Expectations  dropped  when  the  machine  was  limited  to  one  processor 
array,  to  about  200  megaflops.  This  goal  has  not  been  realized  yet,  though 
ILLIAC  IV  does  currently  achieve  15  megaflops.  The  CDC  7600  at  Ames 
has  a theoretical  limit  of  10  megaflops  but  has  only  been  run  at  5 megaflops. 

PEPE--PEPE  is  an  associatively  addressed  ensemble  of  parallel  processing 
elements.  [CRAN72]  It  uses  associative  access  methods  to  store  or  re- 
trieve data  from  a processing  element.  There  are  no  direct  communica- 
tion paths  between  elements,  as  was  the  case  in  ILLIAC  IV,  except  through 
the  control  unit;  for  this  reason,  PEPE  is  termed  an  ensemble.  Because 
the  data  base  that  PEPE  works  on  represents  objects  that  interact  very  little 
with  each  other,  there  is  no  need  for  direct  intercommunication. 

Figure  6 shows  a very  high  level  diagram  of  PEPE,  whereas  Figure  7 
shows  a next  lower  level  view  of  a processing  element  (PE).  The  final 
PEPE  design  has  288  elements,  but  the  currently  implemented  system 
has  only  11  elements. 

PEPE  is  a parallel  processor  designed  to  perform  real-time  radar 
tracking.  It  was  designed  specifically  to  augment  (offload)  a general- 
purpose  machine  (e.g.,  CDC  7600)  in  a ballistic  missile  defense  environ- 
ment. The  application  speed  requirements  dictate  the  ability  to  input. 
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Figure  6.  PEPE  Block  Diagram 
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process,  and  output  data  simultaneously.  This  requirement  is  met  by 
simultaneous  operations  in  each  PE.  PEPE  can  be  inputting  data  through 
the  Correlation  Unit/  Correlation  Control  Unit  Complex  (CU/CCU),  comput- 
ing on  tracks  in  the  Arithmetic  Unit/  Arithmetic  Control  Unit  Complex 
(AU/ACU),  and  outputting  through  the  Associative  Output  Unit/  Associative 
Output  Control  Unit  Complex  (AOU/AOCU),  as  seen  in  Figure  8. 

The  PEPE  prototype  has  16  processing  elements  arranged  in  two  wings  of 
eight  processing  elements  each.  Each  processing  element  consists  of  an 
arithmetic  unit,  a correlation  unit,  an  associative  output  unit,  and  a 
memory.  A global  control  unit,  consisting  of  an  arithmetic  control  unit, 
a correlation  control  unit,  and  an  associative  output  control  unit,  serves 
to  couple  the  ensemble  of  processing  elements  to  the  host.  There  is  no 
interprocessor  communication. 

1 

1 

1 

The  processing  elements  perform  arithmetic  processes,  such  as  track  up- 
dating and  track  prediction.  Only  "active"  processing  elements  will 
perform  the  current  instruction  broadcast  by  the  controller,  and  all  the 
inactive  processing  elements  will  remain  idle.  Each  processing  element 
contains  an  8-bit,  content-addressable  tag  register  together  with  the 
contents  of  the  pi'ocessing  element.  Using  this  ability,  all  object  ranges 
can  be  simultaneously  searched  to  find  maximum,  minimum,  and  inbe- 
tween  limits.  These  processes  can  be  combined  to  perform  ordered 
retrieval. 
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The  associative  output  unit  and  the  associative  output  control  unit  were 
added  to  the  initial  design  in  order  to  meet  I/O  considerations  not  initially 
anticipated.  The  operation  is  similar  to  that  of  the  arithmetic  units,  but 
for  the  lack  of  arithmetic  ability.  Its  function  is  data  sorting,  and  it  is 
the  primary  means  of  system  output. 

The  correlation  unit  is  essentially  a content-addressable  memory.  It  has 
eight  words  of  40  bits  each  in  each  unit.  Used  for  associative  comparisons 
of  incoming  data,  it  is  only  able  to  handle  integer  operations.  Its  instruc- 
tion repertoire  consists  primarily  of  associative  memory  match  and  store 
instructions. 


The  PEPE  can  best  be  described  as  an  associatively  organized,  highly 
parallel  computer.  It  is  a SIMD  machine  like  the  ILLIAC  IV,  but  unlike 
the  ILLIAC  IV,  it  does  not  have  direct  processor  interaction. 

An  operation  on  a pair  of  operands  will  take  longer  on  a content  address- 
able parallel  process  than  on  a conventional  machine,  because  it  will  be 
serial  by  bit  rather  than  parallel  by  bit.  But  if  the  same  operation  is  to 
be  done  on  N pairs  of  operands,  the  conventional  machine  will  take  N 
times  as  long  as  it  does  for  one  pair  where  the  time  for  the  content 
addressable  parallel  processor  will  be  the  same  as  for  one  pair  of  operands. 


On  a conventional  machine,  finding  the  largest  element  of  an  array 
requires  time  of  order  N,  while  in  the  content  addressable  parallel 
processor  it  requires  time  of  order  one  (independent  of  N).  The  time 
required  to  compare  N elements  with  each  other  is  of  the  order  of  N 
squared  in  a conventional  machine  while  in  the  content  addressable  parallel 
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processor  it  is  on  the  order  of  N.  In  general,  for  most  problems  it  is 
fair  to  say  that  content  addressable  parallel  processors  reduce  the 
complexity  of  the  problem  by  an  order  of  N.  [FOST  76] 


SIMP  Drawbacks --The  basic  constraint  on  a SIMP  machine  is  that  each 
memory  module  can  satisfy  only  one  request  per  memory  cycle.  For  best 
operation,  the  N operands  of  a vector  operation  will  lie  in  distinct  memory 
modules.  This  presents  an  interesting  problem  when  storing  a matrix  in 
a SIMP  computer  made  with  the  processing  elements  connected  in  an  array 
structure.  It  is  clear  that  for  a matrix  to  be  operated  on  row -by-row, 
the  most  efficient  storage  of  that  matrix  would  associate  each  element  of 
each  row  with  a different  processing  element.  If  a matrix  is  to  be  operated 
on  colunon-wise,  it  is  necessary  to  have  each  column  element  in  a separate 
processor.  For  this  reason,  matrices  in  SIMP  machines  are  stored  in  a 
storage  format  known  as  skewed  storage.  Successive  rows  are  cyclically 
shifted  by  unit  amounts  so  each  column  and  each  row  is  distributed  among 
several  memory  elements.  With  skewed  storage,  both  the  rows  and  the 
columns  of  the  matrix  can  be  addressed  in  unit  time. 

SIMP  architectures  are  typically  very  compute -bound,  special-purpose 
machines.  They  are  best  suited  for  problems  using  iterative  algorithms 
(e.g.,  solving  boundary  value  problems)  typical  to  weather  bureau,  nuclear 
reactor,  and  target  tracking  signal  processing  applications.  They  are  not 
useful  for  general  applications  because  of  the  difficulty  in  finding  parallel 
algorithms  suitable  for  an  array  processor.  A great  deal  of  this  problem 
lies  in  the  years  the  industry  has  spent  designing  serial  algorithms,  while 
ignoring  parallel  algorithms. 
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A large  number  of  parallel  machines  have  been  proposed  over  the  last  20 
years,  but  only  three  designs  have  been  built:  ILLIAC  IV,  PEPE,  and  OMEN 
by  Sanders  Associates.  The  first  two  machines  have  been  described. 


MISD  Machines  . 

MISD  machines  are  commonly  called  pipeline  machines.  Pipeiined  pro- 
cessors are  well  suited  for  processing  vectors  and  arrays.  In  fact,  many 
of  the  Iarg»  machines  using  extensive  pipelining  also  feature  vector  instruc- 
tion capability.  Peripheral  processors  are  also  available  for  use  with  a 
host  general-purpose  computer  to  give  vector  processing  capabilities,  as 
well  as  microprogrammed  routines  for  functions  useful  in  signal  processing 
applications  (e.g.  , the  IBM  3838,  a stored  program  computer  usually  used 
as  an  attachment  with  a System  370). 

1 

! 

1 

Traditional  computers  execute  scalar  instructions  with  two  source  and  one 
destination  operands.  A machine  with  a vector  instruction  set  operates  on 
operands  which  are  vectors  and  may  contain  hundreds  or  thousands  of 
elements.  In  a vector  instruction,  each  word  of  the  vector  will  undergo 
the  same  operation.  Vector  instructions  may  operate  on  two-vector 
operands  or  on  a one-vector  operand  (e.g. , sum  the  elements  of  a vector). 
Vector  operations,  when  implemented  on  a scalar  machine  (i.  e. , SISD), 
will  typically  take  the  form  of  a loop  such  as: 

DO  10  I = 1,500 
10  A (I)  = B(I)  + C(I) 

This  can  be  reduced  to  a single  statement  on  a vector  machine. 
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A vector  machine  may  be  considered  a special  case  of  a pipeline  or  vice- 
versa.  The  CDC  6600  and  CDC  7600  utilize  intermediate  storage  fo*  all 
operands,  only  allowing  one  instruction  to  use  a functional  unit  at  a time. 
Their  architectures  and  applications  led  to  the  CDC  STAR,  which  streams 
operands  and  the  CRAY-1,  which  chains  operations.  These  machines  are 
important  in  that  they  can  be  used  on  scientific  problems  in  which  the 
scalar  machines  would  be  completely  bogged  down.  But  the  problem  is 
that  most  machines  specialized  for  vector  operations  are  not  necessarily 
good  at  scalar  operations. 

A recent  study  done  at  Los  Alamos  Scientific  Laboratories  [RUDS77]  has 
determined  that  unless  a vector  machine  in  scalar  mode  can  at  least  compete 
with  the  scalar  machines,  they  are  not  worthy  of  consideration 
for  anything  but  highly  specialized  applications.  Ruds inski  and  Worlton 
state  that: 

"Figure  9 shows  the  performance  bounds  for  three  different  ratios  of 
scalar  performance  in  the  machines  being  compared.  The  values  on  the 
ordinate  are  the  ratio  of  the  performance  of  the  scalar-vector  machine 
to  the  existing  scalar  machine.  The  first  scale  indicates  that  both  machines 
perform  scalar  operations  equally  well.  Note  that  even  for  an  infinitely 
fast  vector  processor  the  performance  gain  from  50  percent  vector  work  is  at 
most  a facto*  of  two;  the  performance  gain  from  75  percent  vector  work  is  at 
most  a factor  of  four;  etc.  This  limitation  on  the  performance  gain  is 
entirely  due  to  the  scalar  work  that  remains  after  the  time  to  complete  the 
vector  work  is  set  to  zero. 
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Figure  9.  Maximum  Performance  Gain  from  Vectorization 

"The  second  scale  in  Figure  9 illustrates  the  situation  where  the  scalar 
speed  of  the  scalar-vector  computer  is  significantly  slower  than  that  of 
tha  existing  machine.  In  particular,  this  ordinate  displays  the  case  where 
the  scalar  mode  of  the  new  machine  is  one -fourth  that  of  the  existing 
machine.  Note  that  we  must  have  75  percent  vector  results  just  to  "catch  up" 
with  the  performance  of  the  existing  scalar  machine.  Because  we  are 
also  assuming  infinite  vector  speed,  in  a real  machine  with  these  scalar 
characteristics,  vector  results  will  have  to  be  somewhat  higher  than  75  percent 
to  "break  even,"  This  illustrates  the  severe  penalty  of  incorporating  a 
slow  scalar  processor  into  a scalar-vector  architecture. 
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"The  third  scale  in  Figure  9 shows  the  effect  of  speeding  up  the  scalar 
processor  in  a scalar-vector  architecture  by  a factor  of  two.  Clearly, 
the  effect  of  scalar  performance  on  scalar-vector  architecture  is  profound.  " 

Vector  machines  can  offer  an  order  of  magnitude  or  better  increase  in 
speed  in  vector  mode  over  a scalar  machine.  The  CRAY-1  is  the  first 
vector  machine  to  produce  enough  scalar  performance  together  with  vector 
performance  to  be  successful,  according  to  the  Los  Alamos  study, 

STAR- 100 --The  STAR-100  has  vector  operations  included  in  its  instruction 
set.  The  instructions  specify  operations  on  variable  length  streams  of 
data  which  allow  full  use  of  the  memory  bandwidth  (512  bits  per  minor 
cycle  of  40  nanoseconds)  and  arithmetic  pipeline  (capable  of  producing 
100  million  32 -bit,  of  50  million  64 -bit,  floating  point  results  per  second). 

The  heart  of  the  central  processor  is  the  stream  control  unit.  The  stream 
control  unit  includes  an  instruction  control  unit  that  has  a stack  of  64 
instructions  filled  via  a hardware  look-ahead  of  eight  to  sixteen  instructions, 
i.e. , it  can  prefetch  a block  of  instructions  while  processing  takes  place 
from  the  stack. 

The  STAR-100  stream  unit  also  has  a 256 -word  register  file,  addressed  as 
the  lowest  part  of  main  memory,  for  temporary  sources  and  destinations  for 
register  operations.  The  stream  control  unit  controls  the  arithmetic  unit 
and  the  string  unit.  The  floating  point  arithmetic  unit  is  divided  into  two 
units  which  can  operate  in  parallel;  one  pipeline  for  add,  subtract,  multiply, 
and  similar  vector  instructions;  and  a second  pipeline  for  divide,  square 
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root,  and  vector  instructions.  The  string  unit  is  for  operating  on  strings 
I . of  binary  and  decimal  numbers,  i.e..  variable  length  operands  such  as 

those  required  for  character  string  operations.  All  pipelines  have  a minor 
! , cycle  time  of  40  nanoseconds. 

Three  channels  from  the  memory  to  the  arithmetic  unit  allow  two  vectors 
to  be  inserted  into  the  top  of  the  pipeline  and  a result  stream  to  be  sent 
' back  to  memory,  all  simultaneously.  An  inhibit  bit  stream  may  also  be 

used  to  inhibit  the  designated  operation  on  particular  elements  of  a vector. 

The  STAR-100  has  approximately  130  scalar  and  65  vector  instructions.  It 
is  currently  used  at  Lawrence  Livermore  Radiation  Laboratories.  Like 
many  of  the  supercomputers,  it  was  several  years  late  in  its  delivery. 

TEXAS  lNSTRUMENTS-ASC--The  TI-ASC  central  processor  is  built 
around  an  instruction  processing  unit  (IPU).  This  preprocesses  the 
instruction  stream,  which  typically  contains  a mixture  of  scalar  and  vector 
instructions.  The  IPU  is  pipelined,  with  four  segments  to  the  pipeline. 
These  are  controlled  by  the  IPU. 

The  central  processor  may  have  one,  two,  three,  or  four  pipelined  execu- 
tion units.  Each  unit  can  have  up  to  eight  instructions  in  different  stages 
of  execution  in  the  pipeline. 

The  central  processor  has  sixteen  base  address  registers,  sixteen  arith- 
metic registers,  eight  index  registers  and  eight  vector  parameter  registers. 
These  last  registers  are  used  to  extend  the  instruction  format  for  the  com- 
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plete  specifications  of  vector  instructions.  They  are  the  vector  control 
registers  which  contain  the  parameters  (vector  control  blocks)  that 
describe  vectors  to  be  operated  on.  The  vector  parameters  can  provide 
three  dimensions  of  addressability.  This  corresponds  to  the  nesting  of 
three  index  loops  in  FORTRAN  for  a scalar  operation  machine  and  can 
be  used  to  process  a three-dimensional  array. 

The  vector  capabilities  of  the  control  processor  are  made  available  through 
the  use  of  VECTL  (vector  operation  after  loading  the  vector  parameters) 
and  VECT  (which  assumes  the  parameters  are  already  loaded)  instructions. 
The  Tl-ASC  has  177  scalar  and  70  vector  operations  that  it  can  perform. 

The  processor  has  operand  look-ahead  capability,  which  requests  memory 
references  prior  to  the  time  of  actual  need.  Each  arithmetic  pipeline  has 
a memory  buffer  unit  that  provides  smooth  data  flow  to  and  from  each  unit. 
The  X and  Y operand  streams  and  the  Z result  stream  are  all  double- 
buffered.  A TI-ASC  buffer  is  an  eight-word  set  of  data,  used  because  the 
TI-ASC  fetches  and  stores  data  in  groups  of  eight  words,  so  that  each 
stream  has  a sixteen-word  buffer  associated  with  it. 

The  TI-ASC,  as  all  vector  machines,  reaches  its  maximum  throughput 
when  in  vector  mode.  Each  pipelined  arithmetic  unit  can  produce  a result 
every  60  nanoseconds.  A processor  with  four  pipelined  arithmetic  units 
can  produce  four  results  each  60  nanoseconds,  or  an  effective  rate  of  one 
result  every  15  nanoseconds. 


38 


Because  the  TI-ASC  can  be  configured  with  multiple  pipelines,  the  Tl-ASC 
compiler  software  is  responsible  both  for  the  generation  of  vector  instruc- 
tions and  the  partitioning  of  vector  operations  over  multiple  pipelines.  For 
mixtures  of  vector  and  scalar  instructions,  the  compiler  enforces  proper 
precedence  through  the  use  of  directives  for  the  central  processor  to 
operate  in  parallel  mode  (FORK)  or  sequential  mode  (JOIN).  Programs 
compiled  for  a one -pipeline  ASC  will  execute  correctly  on  a multiple  pipe- 
line system,  but  performance  will  be  increased  via  recompilation  for  the 
multiple  pipeline  machine. 

There  are  several  TI-ASC  machines  in  use  today.  The  first  model  is  now 
being  used  for  software  development.  The  second  model  is  now  in  Amstel- 
veen,  Holland,  devoted  to  seismic  data  processing  work.  Seismic  operation 
is  characterized  by  large  data  bases  and  many  job  steps  of  long 
computational  sequences.  The  third  model  is  being  used  by  the  U.  S.  Army 
for  ballistic  missile  defense.  The  fourth  model  is  used  by  the  National 
Oceanic  and  Atmospheric  Administration.  Preliminary  reports  of  the 
performance  of  this  last  system  on  a weather  benchmark  indicate  it  is 
several  times  faster  than  any  conventional  computer  systems;  it  is  about 
five  times  faster  than  a CDC  7600  or  an  IBM  360  model  195,  and  about 
forty  times  faster  than  an  IBM  360  model  65. 

CRAY-1 


The  computation  section  of  the  CRAY-1  consists  of  an  instruction  control 
unit,  operating  registers,  and  functional  units.  The  instruction  contirol 
performs  all  decisions  related  to  instruction  issue,  and  coordinates  the 


three  types  of  processing:  vector,  scalar,  and  address.  Associated  with 
each  type  of  processing  are  registers  and  functional  units  for  that  process- 
ing mode.  For  vector  processing  there  are: 

(1)  a set  of  64 -bit,  64 -element  vector  registers; 

(2)  three  functional  units  dedicated  to  vector  applications:  add, 
logical,  and  shift;  and 

(3)  three  floating  point  functional  units  to  support  both  scalar  and 
vector  operation;  add,  multiply,  and  reciprocal  approximation, 
the  latter  being  used  for  division. 

The  flow  of  data  is  generally  from  memory  to  registers,  then  registers  to 
functional  units.  The  flow  of  results  is  from  functional  units  to  register, 
and  then  from  register  to  memory,  or  back  to  functional  unit.  £>ata  flows 
either  on  a scalar  or  a vector  path,  the  one  exception  being  scalar  registers, 
which  can  be  used  as  one  of  the  operands  required  for  vector  operations 
(e.g.,  adding  a constant  to  each  element  of  a vector). 

In  contrast  to  the  TI-ASC  and  STAR -100  in  which  scalars  and  vectors  share 
operand  and  result  registers,  and  all  operations  share  the  same  arithmetic 
units,  the  CRAY-1  has  vector  functional  units  that  operate  solely  on  vectors 
and  eight  vector  registers  dedicated  only  to  vector  operands,  each  with 
64  words  of  64  bits.  The  successive  elements  within  a vector  register  are 
considered  to  make  up  a vector. 

Each  element  of  a vector  is  processed  identically.  A vector  operation 
begins  by  taking  the  first  elements  of  one  or  more  vector  registers  and 
delivering  the  result  to  the  first  element  of  a vector  register.  Successive 
elements  can  be  provided  each  clock  period  (12.5  nanoseconds).  The 


vector  instruction  continues  until  the  number  of  operations  performed 
equals  a covuit  specified  by  a vector  length  register. 

Results  are  obtained  from  the  vector  functional  units  every  clock  pulse  of 
12.5  nanoseconds.  A result  will  be  received  by  a previously  identified 
register  and  may  be  received  by  a register  and  transmitted  as  an  operand 
to  a subsequent  operation  in  the  same  clock  period.  This  allows  chaining 
of  vector  operations. 

The  CRAY-1  can  achieve  parallel  vector  processing  in  two  modes,  either 
several  completely  parallel  vector  operations  using  different  functional 
units,  or  by  using  one  vector  result  stream  as  the  operand  stream  for 
another  separate  functional  unit  that  is  operating  simultaneously. 

There  is  a vector  mask  register,  64  bits  long,  used  in  conjunction  with 
vector  merge  and  test  instznictions,  allowing  operations  to  be  performed 
on  selected  vector  elements.  Each  bit  of  the  vector  mask  corresponds  to 
a word  element  in  a vector  register. 

The  CRAY-1  is  the  fastest  number  cruncher  available  today.  It  has  no 
assembler  or  peripherals  and  depends  on  a host  for  operation  (although  a 
CRAY-1  processes  faster  than  a CDC-7600  can  supply  it  with  information). 
It  has  a large,  very  fast  memory  (one  million  words  of  bipolar  memory) 
with  a memory  cycle  time  of  50  nanoseconds  and  requires  125  nanoseconds 
to  fetch  operands  from  memory  to  the  registers.  When  operating  on 
vectors,  it  can  supply  a register  with  an  operand  every  12.  5 nanoseconds. 
The  bipolar  memory  requires  that  the  CRAY-1  be  freon  cooled. 
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The  CRAY  costs  close  to  $10  million,  which  is  considerably  more  than  the 
$6  million  necessary  to  buy  a STAR,  which  can  handle  its  own  peripherals 
and  has  a FORTRAN  compiler.  At  best,  the  CRAY-1  can  deliver  results 
twice  as  fast  as  the  STAR.  Grosch's  law  (that  the  computing  power 
increases  as  the  square  of  the  cost)  seems  to  break  down  for  very  large 
supercomputers . 

Evaluation  of  performance  supplied  by  Los  Alamos  Scientific  Laboratory 
shows  that  the  CRAY-1  is  from  3.4  to  5.1  times  faster  than  a CDC  7600 
(a  machine  with  a theoretical  maximum  output  of  ten  million  results  per 
second)  in  vector  operations  on  vectors  ranging  from  20  to  500  words  in 
length.  Also,  system  availability  and  mean  time  to  failure  were  found  to  be 
well  above  that  of  the  7600. 


SECTION  4 


CONCLUSIONS 


In  the  past,  the  increase  in  speed  and  size  of  computer  systems  was  due 
in  large  part  to  componentry.  There  is  still  close  to  an  order  of  magnitude 
Increase  in  speed  expected  from  technological  advances  in  electronics. 

Other  techniques  under  development  will  also  speed  up  computers  of  the 
future.  Super-cooled  Josephson  junctions  are  expected  to  reach  the  speed 
of  0.02  nanoseconds.  Electro -optical  logic  elements  have  already  been 
constructed,  and  the  speed  of  operation  is  expected  to  be  below  0.01  nano- 
seconds. Another  potential  for  speed  lies  in  elastic  surface -wave  technol- 
ogy,  where  switching  speeds  may  reach  below  0.01  nanoseconds. 

Yet  the  speed  of  hardware  alone  will  not  be  able  to  satisfy  all  of  the  demands 
put  on  computers  in  the  future.  It  is  clear  from  the  examples  presented 
that  architectural  design  techniques  can  and  will  play  a large  role  in 
determining  the  performance  of  computer  systems.  These  techniques 
will  become  more  prevalent  in  future  machines. 

The  architectural  concepts  presented  are  not  limited  to  large  scale,  special- 
purpose  architectures.  They  can  be  applied  to  general-purpose  machines 
and  can  be  applied  even  to  small-scale  machines.  And  they  can  be  used 
without  ever  having  to  buy  a fancy  processor,  e.g.,  the  Ramseyer-van 
Dam  multi -processor  machine.  [RAMS  77] 


The  explosive  growth  of  LSI  and  VLSI  will  have  an  enormous  effect  on 
micro-,  mini-,  conventional,  and  supercomputers.  Each  computer  will 
have  its  own  use,  eind  each  application  will  have  a computer  suited  for  it. 
Many  of  the  techniques  discussed  will  appear  in  all  of  these  machines.  The 
only  difference  will  be  in  the  amount  of  cache,  parallelism,  pipelining,  or 
associative  memory  present  in  each  machine. 

The  cost  of  computers  should  decrease  dramatically  with  LSI.  It  is  esti- 
mated that  a processor  with  the  same  processing  power  as  the  STAR-100 
will  cost  about  $50,000  by  the  middle  of  the  1980's,  a cost  reduction  by  a 
factor  of  120. 

Many  techniques  will  be  combined.  In  fact,  as  the  size  and  cost  of 
components  decrease,  there  will  be  an  increase  in  the  number  of  machines 
combining  a variety  of  architectural  techniques.  For  example,  most 
large  machines  already  have  some  content-addressable  memory  to  handle 
memory  paging.  The  CDC  6600  and  the  TI-ASC  combine  special-purpose 
processor  units  with  a multiprocessor  peripheral  control.  Distributed 
processors  are  being  constructed  using  tightly  coupled  networks  of  minis 
or  micros,  with  collections  of  processors  configured  in  a pipeline. 

[FULL  73],  [FULL  75],  [RAMS  77] 

Reliability  of  components  should  continue  to  increase.  The  availability  of 
computer  systems  will  reflect  the  increase  in  the  reliability  of  components; 
as  size  and  cost  decrease,  there  will  be  replication  of  units  for  the 
purpose  of  making  computers  fault-tolerant. 
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Complete  processors  are  already  available  on  one  chip.  When  several 
processors  are  available  on  one  chip,  the  creation  of  array  processors 
and  even  array  processors  with  backup  processing  elements  will  be  very 
easy  and  inexpensive. 

The  rapid  advances  in  computing  power  will  continue  for  some  time.  And 
as  the  alternatives  available  become  greater  in  number,  architectural 
design  techniques  will  play  an  increasing  role  in  the  development  of  com- 
puter systems.  In  terms  of  military  computers,  the  trend  will  no  doubt 
go  towards  emulation  engines  implemented  out  of  microprogrammable 
microprocessors  (e.g.,  AM2901,  M10800).  These  sorts  of  engines  then 
may  be  used  in  place  of  current  standard  mil  computers  like  the  AN/UYK- 
20,  but  will,  in  any  case,  be  software-compatible  with  one  or  more 
standard  computers  at  least  in  one  emulation  mode. 

Much  processing  currently  done  by  smaller  SISD  machines  will  be  parsed 
and  distributed  to  networks  of  microprocessors  and/or  microcomputers  to 
achieve  more  cost-effective,  faster,  and  reliable  systems. 

Today's  general-purpose,  large-scale  main  frames  also  will  eventually 
follow  this  basic  trend  and  end  up  as  a set  of  smaller  special-purpose 
processors  which  themselves  are  implemented  via  a third  generation 
microprocessor  or  special  purpose  hard  wired  processors. 
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